In this practical, we are first going to refresh our knowledge (or get acquainted) with Python in Google Colab, then we will continue with text mining and regular expressions! Are you looking for Python documentation to refresh you knowledge of programming? If so, you can check https://docs.python.org/3/reference/
Google Colab¶
Google Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with:
- Zero configuration required
- Free access to GPUs and more
- Easy sharing
Colab notebooks are Jupyter notebooks that are hosted by Colab. You can find more detailed introduction to Colab here, but we will also cover the basics below.
Simple text processing¶
1. Open Colab and create a new empty notebook to work with Python 3!
Go to https://colab.research.google.com/ and login with your account. Then click on "File $\rightarrow$ New notebook".
If you want to insert a new code chunk below of the cell you are currently in, press Alt + Enter
(option + Enter
on Mac).
If you want to stop your code from running in Colab:
- Interrupt execution by pressing
ctrl + M I
or simply click the stop button - Or: Press
ctrl + A
to select all the code of that particular cell, pressctrl + X
to cut the entire cell code. Now the cell is empty and can be deleted by usingctrl + M D
or by pressing the delete button. You can paste your code in a new code chunk and adjust it.
NB: On Mac, use cmd
instead of ctrl
in shortcuts.
2. Text is also known as a string variable, or as an array of characters. Create a variable a
with the text value of "Hello @Text Mining World! I'm here to learn, right?"
, and then print it!
a = "Hello @Text Mining World! I'm here to learn, right?"
a
"Hello @Text Mining World! I'm here to learn, right?"
3. Print the first and last character of your variable.
print(a[0]) # if you do not use the print function, it will print only the last argument in the cell
print(a[31])
l = len(a)
print("Length of your string is: ", l)
print(a[l-1])
H e Length of your string is: 51 ?
4. Use the !pip install
command and install the packages: numpy
, nltk
, gensim
, and spacy
.
NB: The re
package (for regular expressions) is part of Python's standard library and comes pre-installed. You don't need to run !pip install re
. You can simply import it and use it directly in your code.
NB: Generally, you only need to install each package once on your computer and load it again, however, in Colab you may need to reinstall a package once you are reconnecting to the network.
!pip install -q numpy
!pip install -q nltk
!pip install -q gensim
!pip install -q spacy
5. Import (load) the nltk
package and use the function lower()
to convert the characters in string a
to their lowercase form and save it into a new variable b
.
import nltk
b = a.lower()
b
"hello @text mining world! i'm here to learn, right?"
NB: nltk
comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: https://www.nltk.org/nltk_data/
To install the data, after installing nltk
, you could use the nltk.download()
data downloader. We will make use of this in Question 8.
6. Use the string
package to print the list of punctuations.
Punctuations can separate characters, words, phrases, or sentences. In some applications they are very important to the task at hand, in others they are redundant and should be removed! We will learn more about this in text pre-processing.
import string
print(string.punctuation)
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
7. Use the punctuation list to remove the punctuations from the lowercase form of our example string a
. Name your variable c
.
# Remember there are many ways to remove punctuations! This is only one of them:
c = "".join([char for char in b if char not in string.punctuation])
print(c)
hello text mining world im here to learn right
8. Use the function Regexptokenizer()
from nltk
to tokenize the string b
whilst removing punctuations (tokenization is the process of splitting text into smaller units, such as words, sentences, or subwords; we'll talk more about this next week).
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize(b)
['hello', 'text', 'mining', 'world', 'i', 'm', 'here', 'to', 'learn', 'right']
Working with text datasets¶
Working with a text dataset is similar to simple text processing.Here are some websites with publicly available datasets:
We want to analyze the Taylor Swift song lyrics data from all her albums. Download the dataset from the course website or alternatively from Kaggle.
Upload taylor_swift_lyrics.csv
to Google Colab. You can do this by clicking on the Files button on the very left side of Colab and drag and drop the data there or click the upload button. Alternatively you can mount Google Drive and upload the dataset there.
Taylor Swift Lyrics dataset¶
9. Read the taylor_swift.csv
dataset. Check the dataframe using head()
and tail()
functions and the iloc
attribute.
import pandas as pd
ts_lyrics = pd.read_csv("data/taylor_swift_lyrics.csv")
ts_lyrics.head()
Artist | Album | Title | Lyrics | |
---|---|---|---|---|
0 | Taylor Swift | Taylor Swift | Tim McGraw | He said the way my blue eyes shinx\nPut those ... |
1 | Taylor Swift | Taylor Swift | Picture to Burn | State the obvious, I didn't get my perfect fan... |
2 | Taylor Swift | Taylor Swift | Teardrops on my Guitar | Drew looks at me,\nI fake a smile so he won't ... |
3 | Taylor Swift | Taylor Swift | A Place in This World | I don't know what I want, so don't ask me\n'Ca... |
4 | Taylor Swift | Taylor Swift | Cold As You | You have a way of coming easily to me\nAnd whe... |
ts_lyrics.tail()
Artist | Album | Title | Lyrics | |
---|---|---|---|---|
127 | Taylor Swift | folklore | mad woman | What did you think I'd say to that?\nDoes a sc... |
128 | Taylor Swift | folklore | epiphany | Keep your helmet\nKeep your life, son\nJust a ... |
129 | Taylor Swift | folklore | betty | Betty, I won't make assumptions about why you ... |
130 | Taylor Swift | folklore | peace | Our coming of age has come and gone\nSuddenly ... |
131 | Taylor Swift | folklore | hoax | My only one\nMy smoking gun\nMy eclipsed sun\n... |
ts_lyrics.iloc[0]
Artist Taylor Swift Album Taylor Swift Title Tim McGraw Lyrics He said the way my blue eyes shinx\nPut those ... Name: 0, dtype: object
ts_lyrics.head(1)
Artist | Album | Title | Lyrics | |
---|---|---|---|---|
0 | Taylor Swift | Taylor Swift | Tim McGraw | He said the way my blue eyes shinx\nPut those ... |
10. Use the contains
and write a regex that finds rows where the lyrics contain a specific word, such as "love".
import re
from IPython.display import display
love_lyrics = ts_lyrics[ts_lyrics['Lyrics'].str.contains(r'\blove\b')]
display(love_lyrics)
Artist | Album | Title | Lyrics | |
---|---|---|---|---|
1 | Taylor Swift | Taylor Swift | Picture to Burn | State the obvious, I didn't get my perfect fan... |
2 | Taylor Swift | Taylor Swift | Teardrops on my Guitar | Drew looks at me,\nI fake a smile so he won't ... |
6 | Taylor Swift | Taylor Swift | Tied Together With A Smile | Seems the only one who doesn't see your beauty... |
7 | Taylor Swift | Taylor Swift | Stay Beautiful | Cory's eyes are like a jungle\nHe smiles; it's... |
9 | Taylor Swift | Taylor Swift | Mary’s Song | She said\n"I was seven, and you were nine\nI l... |
... | ... | ... | ... | ... |
122 | Taylor Swift | folklore | seven | Please picture me\nIn the trees\nI hit my peak... |
123 | Taylor Swift | folklore | august | Salt air\nAnd the rust on your door\nI never n... |
129 | Taylor Swift | folklore | betty | Betty, I won't make assumptions about why you ... |
130 | Taylor Swift | folklore | peace | Our coming of age has come and gone\nSuddenly ... |
131 | Taylor Swift | folklore | hoax | My only one\nMy smoking gun\nMy eclipsed sun\n... |
62 rows × 4 columns
11. Use the count
function and write a regex that counts how many times the word "love" appears in each lyric.
ts_lyrics['love_count'] = ts_lyrics['Lyrics'].str.count(r'\blove\b', flags=re.IGNORECASE)
print(ts_lyrics[['Lyrics', 'love_count']])
Lyrics love_count 0 He said the way my blue eyes shinx\nPut those ... 0 1 State the obvious, I didn't get my perfect fan... 2 2 Drew looks at me,\nI fake a smile so he won't ... 2 3 I don't know what I want, so don't ask me\n'Ca... 0 4 You have a way of coming easily to me\nAnd whe... 0 .. ... ... 127 What did you think I'd say to that?\nDoes a sc... 0 128 Keep your helmet\nKeep your life, son\nJust a ... 0 129 Betty, I won't make assumptions about why you ... 1 130 Our coming of age has come and gone\nSuddenly ... 2 131 My only one\nMy smoking gun\nMy eclipsed sun\n... 2 [132 rows x 2 columns]
12. Write a regex that extracts all words that are exactly 4 characters long in each lyric.
ts_lyrics['four_letter_words'] = ts_lyrics['Lyrics'].str.findall(r'\b\w{4}\b')
print(ts_lyrics[['Lyrics', 'four_letter_words']])
Lyrics \ 0 He said the way my blue eyes shinx\nPut those ... 1 State the obvious, I didn't get my perfect fan... 2 Drew looks at me,\nI fake a smile so he won't ... 3 I don't know what I want, so don't ask me\n'Ca... 4 You have a way of coming easily to me\nAnd whe... .. ... 127 What did you think I'd say to that?\nDoes a sc... 128 Keep your helmet\nKeep your life, son\nJust a ... 129 Betty, I won't make assumptions about why you ... 130 Our coming of age has come and gone\nSuddenly ... 131 My only one\nMy smoking gun\nMy eclipsed sun\n... four_letter_words 0 [said, blue, eyes, that, said, That, Just, Tha... 1 [didn, love, more, than, ever, love, tell, you... 2 [Drew, fake, What, want, what, need, that, Tha... 3 [know, what, want, know, what, down, this, roa... 4 [have, when, take, take, very, best, need, fee... .. ... 127 [What, that, Does, when, back, They, kill, kno... 128 [Keep, your, Keep, your, life, Just, Here, you... 129 [make, your, time, when, your, like, from, Ine... 130 [come, gone, this, long, near, just, give, fir... 131 [only, This, down, This, Give, Your, love, onl... [132 rows x 2 columns]
13. Write a regex that finds rows where the lyrics contain any numeric characters.
lyrics_with_numbers = ts_lyrics[ts_lyrics['Lyrics'].str.contains(r'\d')]
print(lyrics_with_numbers['Lyrics'])
44 I still remember the look on your face\nLit th... 63 I said, "Oh my, what a marvelous tune"\nIt was... 66 You said it in a simple way\n4AM, the second d... 74 It's 2 A.M. in your car\nWindows down, I pass ... 84 I wanna be your endgame\nI wanna be your first... 89 See you in the dark\nAll eyes on you, my magic... 103 I think he knows his footprints\nOn the sidewa... 111 You are somebody that I don't know\nBut you're... 126 Green was the color of the grass where I used ... Name: Lyrics, dtype: object
Computer review dataset¶
The Computer Review Dataset is an annotated dataset for aspect-based sentiment analysis. The data originates from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html, while you can download a version of it from the course website (https://textminingcourse.nl/labs/week_1/data.zip).
14. Use the readlines
function to read data from the computer.txt
file. Convert the data to a dataframe and name it computer_531
.
# Reading the file line by line into a list
with open("data/computer.txt", "r", encoding="utf-8") as file:
computer_data = file.readlines()
# Removing trailing newline characters
computer_data = [line.strip() for line in computer_data]
# Convert to DataFrame
computer_531 = pd.DataFrame(computer_data, columns=["text"])
print(computer_531)
text 0 ## I purchased this monitor because of budgeta... 1 inexpensive[+1][a] ## This item was the most i... 2 monitor[-1] ## My overall experience with this... 3 screen[-1], picture quality[-1] ## When the sc... 4 monitor[-1], picture quality[-1] ## I 've view... .. ... 526 ## After that , it worked like a champ . 527 ## No problems whatsoever . 528 incompatibility[-1] ## My only grips are the i... 529 ## This is a well know problem with the PCs an... 530 ## Also , the only hard button controls you ge... [531 rows x 1 columns]
15. In this dataset, each line represents a review along with its annotated aspects and sentiments. For example, line 3 is "screen[-1], picture quality[-1] ## review text", examining this line shows that the annotator thinks that this review has two aspects/features, which are screen and picture quality, and both are associated with a negative one sentiment score. The annotation is followed by the characters ## and then the actual review text. What we want to do now is write regular expressions to create a bit of structure for our data: 10) extract all the aspects and put them into a column, 2) the review text in a column, and 3) sum the sentiment scores and in another column give the whole review a positive, negative and neutral sentiment label based on the sign of the summed value. Write the regular expressions step by step with a decent code documentation.
# Extract aspects and sentiments
def extract_aspects(line):
return re.findall(r"(\w+)\[([-+]?\d+)\]", line)
computer_531['aspects_and_sentiments'] = computer_531['text'].apply(extract_aspects)
# Extract review text
def extract_review_text(line):
# Match everything after '##'
match = re.search(r'##\s*(.*)', line)
return match.group(1) if match else None
computer_531['review_text'] = computer_531['text'].apply(extract_review_text)
# Extract only aspects
def get_aspects(aspects_and_sentiments):
return [aspect for aspect, sentiment in aspects_and_sentiments]
computer_531['aspects'] = computer_531['aspects_and_sentiments'].apply(get_aspects)
# Sum sentiment scores
def sum_sentiments(aspects_and_sentiments):
return sum(int(sentiment) for _, sentiment in aspects_and_sentiments)
computer_531['summed_sentiment'] = computer_531['aspects_and_sentiments'].apply(sum_sentiments)
# Assign sentiment labels
def assign_sentiment_label(summed_sentiment):
if summed_sentiment > 0:
return 'Positive'
elif summed_sentiment < 0:
return 'Negative'
else:
return 'Neutral'
computer_531['sentiment_label'] = computer_531['summed_sentiment'].apply(assign_sentiment_label)
# Display the final dataframe
display(computer_531)
text | aspects_and_sentiments | review_text | aspects | summed_sentiment | sentiment_label | |
---|---|---|---|---|---|---|
0 | ## I purchased this monitor because of budgeta... | [] | I purchased this monitor because of budgetary ... | [] | 0 | Neutral |
1 | inexpensive[+1][a] ## This item was the most i... | [(inexpensive, +1)] | This item was the most inexpensive 17 inch mon... | [inexpensive] | 1 | Positive |
2 | monitor[-1] ## My overall experience with this... | [(monitor, -1)] | My overall experience with this monitor was ve... | [monitor] | -1 | Negative |
3 | screen[-1], picture quality[-1] ## When the sc... | [(screen, -1), (quality, -1)] | When the screen was n't contracting or glitchi... | [screen, quality] | -2 | Negative |
4 | monitor[-1], picture quality[-1] ## I 've view... | [(monitor, -1), (quality, -1)] | I 've viewed numerous different monitor models... | [monitor, quality] | -2 | Negative |
... | ... | ... | ... | ... | ... | ... |
526 | ## After that , it worked like a champ . | [] | After that , it worked like a champ . | [] | 0 | Neutral |
527 | ## No problems whatsoever . | [] | No problems whatsoever . | [] | 0 | Neutral |
528 | incompatibility[-1] ## My only grips are the i... | [(incompatibility, -1)] | My only grips are the incompatibility with XP ... | [incompatibility] | -1 | Negative |
529 | ## This is a well know problem with the PCs an... | [] | This is a well know problem with the PCs and A... | [] | 0 | Neutral |
530 | ## Also , the only hard button controls you ge... | [] | Also , the only hard button controls you get a... | [] | 0 | Neutral |
531 rows × 6 columns
16. Save the final computer_531
dataframe in a CSV file. We will be using it in the later parts of the course.
# Save the dataframe to a CSV file
computer_531.to_csv("data/computer_531_final.csv", index=False)
print("CSV has been saved.")
CSV has been saved.